Variational Stochastic Gradient Descent

نویسنده

  • Michael Tetelman
چکیده

In Bayesian approach to probabilistic modeling of data we select a model for probabilities of data that depends on a continuous vector of parameters. For a given data set Bayesian theorem gives a probability distribution of the model parameters. Then the inference of outcomes and probabilities of new data could be found by averaging over the parameter distribution of the model, which is an intractable problem. In this paper we propose to use Variational Bayes (VB) to estimate Gaussian posterior of model parameters for a given Gaussian prior and Bayesian updates in a form that resembles SGD rules. It is shown that with incremental updates of posteriors for a selected sequence of data points and a given number of iterations the variational approximations are defined by a trajectory in space of Gaussian parameters, which depends on a starting point defined by priors of the parameter distribution, which are true hyper-parameters. The same priors are providing a weight decay or L2 regularization for the training. Then a selection of L2 regularization parameters and a number of iterations is completely defining a learning rule for VB SGD optimization, unlike other methods with momentum (Duchi et al., 2011; Kingma & Ba, 2014; Zeiler, 2012) that need selecting learning, regularization rates, etc., separately. We consider application of VB SGD for important practical case of fast training neural networks with very large data. While the speedup is achieved by partitioning data and training in parallel the resulting set of solutions obtained with VB SGD forms a Gaussian mixture. By applying VB SGD optimization to the Gaussian mixture we can merge multiple neural networks of same dimensions into a new single neural network that has almost the same performance as an original Gaussian mixture.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Importance Sampled Stochastic Optimization for Variational Inference

Variational inference approximates the posterior distribution of a probabilistic model with a parameterized density by maximizing a lower bound for the model evidence. Modern solutions fit a flexible approximation with stochastic gradient descent, using Monte Carlo approximation for the gradients. This enables variational inference for arbitrary differentiable probabilistic models, and conseque...

متن کامل

A Variational Analysis of Stochastic Gradient Algorithms

Stochastic Gradient Descent (SGD) is an important algorithm in machine learning. With constant learning rates, it is a stochastic process that, after an initial phase of convergence, generates samples from a stationary distribution. We show that SGD with constant rates can be effectively used as an approximate posterior inference algorithm for probabilistic modeling. Specifically, we show how t...

متن کامل

Tutorial on Variational Autoencoders

In just three years, Variational Autoencoders (VAEs) have emerged as one of the most popular approaches to unsupervised learning of complicated distributions. VAEs are appealing because they are built on top of standard function approximators (neural networks), and can be trained with stochastic gradient descent. VAEs have already shown promise in generating many kinds of complicated data, incl...

متن کامل

Gaussian Processes for Big Data through Stochastic Variational Inference

Gaussian processes [GP 10] are perhaps the dominant approach for inference on functions. They underpin a range of algorithms for regression, classification and unsupervised learning. Unfortunately, exact inference in a GP has complexity O(n) with storage demands of O(n) and this hinders application of these models for ‘big data’. Various approximate techniques have been suggested [see e.g. 1, 1...

متن کامل

Re-using gradient computations in automatic variational inference

Automatic variational inference has recently become feasible as a scalable inference tool for probabilistic programming. The state-of-the-art algorithms are stochastic in two respects: they use stochastic gradient descent to optimize an expectation that is estimated with stochastic approximation. The core computation of such algorithms involves evaluating the loss and its automatically differen...

متن کامل

Early Stopping as Nonparametric Variational Inference

We show that unconverged stochastic gradient descent can be interpreted as a procedure that samples from a nonparametric approximate posterior distribution. This distribution is implicitly defined by the transformation of an initial distribution by a sequence of optimization steps. By tracking the change in entropy over these distributions during optimization, we form a scalable, unbiased estim...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016